Skip to main content

All Questions

Tagged with
1vote
1answer
156views

Group events close in time into sessions and assign unique session IDs

The following is a trimmed-down example of my actual code, but it suffices to show the algorithmic problem I'm trying to solve. Given is a DataFrame with events, each with a user ID and a timestamp. <...
Tobias Hermann's user avatar
4votes
1answer
196views

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

I need help in rewriting my code to be less repetitive. I am used to coding procedural and not object-oriented. My scala program is for Databricks. how would you combine cmd 3 and 5 together? Does ...
Dung Tran's user avatar
2votes
0answers
1kviews

Spark Scala: SQL rlike vs Custom UDF

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" ...
Wiki_91's user avatar
2votes
2answers
801views

Scala app to transpose columns into rows

This is the first application or really any Scala I have every written. So far it functions as I would hope it would. I just found this community and would love some peer review of possible ...
GenericDisplayName's user avatar
1vote
1answer
5kviews

Joining Apache Spark data frames, with many conditional substitutions

I am joining two data frame in spark using scala . My code looks very ugly because of the multiple when condition . Can somebody please help me simplify my code? Here is my existing code . ...
Sudarshan kumar's user avatar
3votes
0answers
2kviews

Apache spark compaction script to handle small files in hdfs

I have some use cases where I have small parquet files in Hadoop, say, 10-100 MB. I would to compact them so as to have files at least say 100 MB or 200 MB. The logic of my code is to: * find a ...
javadev's user avatar
3votes
0answers
2kviews

Adding columns in Spark dataframe based on rules

I have a dataframe df, which contains below data: ...
Varun Chadha's user avatar
2votes
0answers
176views

Reduce sample rate of GPS data based on distance between points

The algorithm needs to reduce an RDD[GPSRecord] based on the distance between several points, e.g. "give me only GPS records when the distance between them exceeds ...
MiguelAraCo's user avatar
3votes
1answer
115views

Classifying and counting database entries using Scala map and flatMap

I am new to Spark and Scala and I have solved the following problem. I have a table in database with following structure: ...
Shams Tabraiz Alam's user avatar
0votes
1answer
6kviews

Unit testing Spark transformation on DataFrame

Looking for suggestions on how to unit test a Spark transformation with ScalaTest. The test class generates a DataFrame from static data and passes it to a transformation, then makes assertion on the ...
wrschneider's user avatar
5votes
0answers
718views

RandomForest multi-class classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I ...
Huga's user avatar
  • 153
5votes
1answer
2kviews

Why does the LR on spark run so slowly?

Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are: 5 nodes, each node with 8 cores (all the ...
Tim's user avatar
  • 151

close